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Abstract — Social web users are a very diverse group with vary- 
^— H ing interests, levels of expertise, enthusiasm, and expressiveness. 

As a result, the quality of content and annotations they create to 
CN| organize content is also highly variable. While several approaches 
^ have been proposed to mine social annotations, for example, 
O to learn folksonomies that reflect how people relate narrower 
concepts to broader ones, these methods treat all users and 
the annotations they create uniformly. We propose a framework 
to automaticaUy identify experts, i.e., knowledgeable users who 
create high quality annotations, and use their knowledge to guide 
folksonomy learning. We evaluate the approach on a large body 
' of social annotations extracted from the photosharing site Flickr. 
Ph We show that using expert knowledge leads to more detailed 
and accurate folksonomies. Moreover, we show that including 
C/D annotations from non-expert, or novice, users leads to more 
comprehensive folksonomies than experts' knowledge alone. 

^ I. Introduction 

> 

Knowledge production is no longer solely in the hands 
l/^ of professionals: on many Social Web sites ordinary people 
OO create and annotate a wide variety of content. On the social 
photosharing site Flickr (http://flickr.com), for example, users 
can publish photographs, tag them with descriptive keywords, 
such as insect or macro, and organize them within per- 
sonal directories. While an individual's annotations express her 
. • particular world view, collectively social annotations provide 
. ^ valuable evidence for harvesting social knowledge, including 
k> folksonomies (folk + taxonomies) that show how people 
^ relate broader concepts to narrower ones. Social knowledge 
is idiosyncratic and may at times conflict with knowledge 
expressed in professionally curated taxonomies. For example, 
many people consider spiders to be insects, at odds with 
the Linnean taxonomy of living organisms. However, such 
knowledge is necessary to make sense of and leverage user- 
generated content on the Social Web. Thus, to find all images 
of spiders, you will sometimes have to look for i n s e c t s . 

Recently, Plangprasopchok et al |1| proposed a method 
to learn folksonomies by integrating structured annotations 
from many users, specifically, personal directories created by 
individual Flickr users to organize their photos. The method 
extends affinity propagation |2| to use structural information 
to concurrently combine many shallow personal directories 
into a larger common taxonomy. The method assumes that 
the quality of annotation from all users is the same. How- 



ever, Social Web users are highly diverse and vary in their 
degree of expertise and expressiveness. Knowledgeable users 
create high quality, detailed annotations, often using technical 
terms. They specify intermediate concepts within multi-level 
directories, e.g., linking jumping spider to spiders to 
arachnids to invertebrates. We call such users ex- 
perts. Novice users, on the other hand, are far less expressive, 
creating shallow directories that jump granularity levels, e.g., 
linking spiders to bugs. Using experts' knowledge enables 
us to learn more accurate and detailed folksonomies. 

Diversity is important for groups and organizations O. It 
can lead to better group decision making and organizational 
robustness (4], as long as individual knowledge and opinions 
are aggregated correctly |5|. Hence, identifying experts from 
the content they create, or from recommendations of other 
people, has been an active research area. Previous works used 
natural language analysis 0, ItI and topic modeling |8| 
techniques to identify experts from the text of documents 
they created, often combining it with analysis of the structure 
of links within an organization |9|, |10|. Annotations on the 
Social Web can help identify diverse classes of users. However, 
while previous researchers classified users based on their 
annotation practices 1 11 1, they did not attempt to automatically 
distinguish expert from novice users. 

In this paper we propose methods to automatically identify 
expert users who provide high quality annotations and leverage 
their knowledge in folksonomy learning. First, in Section |Il| 
we describe and evaluate a method that examines structured 
annotations to automatically identify expert users. Specifically, 
our method analyzes the structure and content of personal 
directories created by Flickr users. In Section |IIl| we extend 
the inference method of Plangprasopchok et al. |1| to use 
experts' knowledge to guide the folksonomy learning process. 
In Section |IV] we show that the inference method that exploits 
user diversity by putting greater weight on annotations created 
by experts can learn more accurate and detailed folksonomies 
than one that ignores diversity. Surprisingly, however, we 
show that while experts' knowledge is required to learn more 
accurate folksonomies, novice knowledge is needed to learn 
more complete folksonomies. We also carry out a detailed 
investigation of the robustness of our method. 



II. Identifying Expert Users 

Experts are knowledgeable individuals who can answer 
questions within organizations and generate high quality data. 
Identifying such people is an important research topic in data 
mining, management science, and social network analysis. 
Researchers have proposed a variety of algorithms for au- 
tomatic expert identification, including language |7|, proba- 
bilistic topic-based lU and statistical (61 models and network 
analysis tools |[T2l . ifTOl , that identify experts based on the 
documents or email messages they exchange within organiza- 
tions. Hybrid approaches that combine topics and relationships 
between users |9 1 have also been explored. 

Expert identification is even more important for mining 
user-generated content, since Social Web users form an ex- 
tremely diverse group, with widely varying levels of expertise 
and enthusiasm for different topics. As a result, the quality 
of data they create also varies tremendously. One way to 
differentiate data quality is by identifying expert users. We 
extend the features used to measure diversity in groups 
and use them within a supervised expert classification method. 
The features measure users' expertise based on the structure of 
annotations they create. Unlike previous works that examined 
(textual) data people create, our method looks directly at 
knowledge structures they express through annotations. 

A. Structured Annotations 

Social web sites allow users to annotate content they create 
or share with others. In addition to tagging content, some 
sites also allow users to organize it hierarchically. Del.icio.us 
users can group related tags into bundles, and Flickr users 
can group related photos into sets and then group related 
sets in collections, thereby creating personal directories to 
organize photos. The sites themselves do not impose any 
rules on the vocabulary or semantics of directories; in practice 
users employ them to represent relations between broader and 
narrower categories or concepts. Personal directories offer 
rich evidence for harvesting social knowledge and have been 
used to learn communal taxonomies of concepts, otherwise 
known sls folks onomies lITSl , ifTl . 

Following Plangprasopchok et al. |[T3ll we call a directory 
a user creates to organize photos on Flickr a The 
root node of the sapling corresponds to a user's collection, 
and inherits its name, while the leaves correspond to the 
collection's constituent sets (or other collections) and inherit 
their names. The photos the user assigns to a set are tagged, 
and we propagate these tags to sets and to their parent col- 
lections. While most users create shallow saplings consisting 
of a top-level collection and constituent sets (see Fig. [TJa)), 
others create detailed, multi-level hierarchies about a topic of 
interest (Fig. [TJb)). We call the latter users experts and the 
former novice. By manually inspecting saplings created by 
Flickr users, we found that structure and semantic consistency 
are two important factors distinguishing expert from novice 

^Saplings are not always tree-like. In these cases we convert them to trees. 




Fig. 1. Saplings created by (a) novice and (b) expert users. 

users. Specifically, we have identified the following hallmarks 
of an expert: 

• generally creates many saplings with distinct concepts 

• creates deep (> 2 levels as in Fig.[TJb)) or broad saplings 

• provides top-level concepts that are meaningful to others. 
Overly-broad concepts, such as 'life', 'things', 'misc', 'all 
sets', etc., imply novice users 

• does not jump many levels, (e.g., attach 'los angeles' to 
'world') nor mix concepts of different granularity level 
(e.g., 'table mountain' and 'equatorial guinea' are never 
siblings, as in Fig. [TJ^)) 

• does not create conflicts (e.g., attach 'los angeles' to 
'journey' in one sapling while attaching 'journey' to 'los 
angeles' in another) 

• does not create multiple child concepts with same name 
(e.g., five 'los angeles' sets under 'journey'). 

B. Features 

To automate expert identification, we convert the observa- 
tions above into quantitative features. We divide the features 
into two classes: user-level and sapling-level features. 

1 ) User Features: Experts express a variety of concepts. 

User- Variety measures the number of saplings (N) and 
NumTwigs the number of relations a user creates. 

User-Balance measures how uniform the saplings are in 
size. We measure this by entropy Bu = (— ^ • pi \npi)/ In 
where pi is the number of nodes in sapling i divided by the 
total number of nodes the user creates. 

User-Disparity measures differences between concepts ex- 
pressed in user's saplings |3|. We compute disparity using 
Jensen-Shannon divergence between the tag distributions of 
the two saplings: 

^J5(r,||r,) = ^(0.5I)(r,||rfe) + 0.5I^(r,||rfc)), (1) 
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(a) (b) 

Fig. 2. Frequency distribution of distinct children of root nodes (a) 'nature' 
and (b) 'other stuff'. 

where represents tag distribution of sapling i, r/e = l/2(r^ + 
Tj) and D{.) is Kullback-Leibler divergence. DisparityNor- 
malized simply divides the above measure by the number of 
nodes in the saplings. 

2) Sapling Features: Experts express detailed knowledge 
in particulars topics, not necessarily all topics. 

Sapling- Variety combines depth and breadth of the sapling: 
Vs = X]i=i L ^ X where L is the depth of the sapling and 
rii is the number of nodes at level i. This gives more credit 
to deeper representations if both saplings are equally large. 

Sapling-Balance measures how balanced the sapling is 
at each level. We quantify balance by normalized entropy 
based on expected number of nodes at current level given 
the number of nodes at the previous level: ^5- = 1/L x 
((- lnp*)/ln(ni)), where rii is number of 
nodes at level i, is proportion of children of j'th node 
at level i. For example, if there are 4 nodes in level 1 with 
3, 3, 1, 2 children respectively, then rii is 4, is (3/9, 3/9, 
1/9, 2/9). To balance level 2, we need between two and three 
children per parent. 

Several features measure concept consistency and node 
uniqueness. Inconsistency can be computed by the number of 
conflicts (i.e. attaching node A to node B in one sapling and 
B to node A in another sapling); agreement is quantified by 
how many users create the same parent-child relation; node 
(or twig) uniqueness is computed by the ratio of unique node 
names to the total number of nodes in the sapling. Other 
features include sapling depth, breadth, number of nodes and 
terminal leaves it has, and the ratio of number of leaves to 
the total number of nodes in the sapling. 

Root-Diversity is an important hallmark of experts. Ex- 
perts create generalizable knowledge using categories that 
are meaningful to others. A vague concept, such as 'misc', 
'other', 'things', will mean different things to different people. 
Consequently, there will be little agreement about the child 
concepts of such root nodes, with every user specifying a 
different child. There is far more agreement about the children 
of more specific concepts, such as 'europe'. We quantify the 
generalizability of a concept by the the distribution of distinct 
child nodes across all users. 

Given a concept (sapling root), we extract all sub-concepts 
users have specified as children of this root. Figure |2] shows 
the distributions of unique children of the roots 'nature' and 



'other stuff, sorted by frequency of occurrence. A peaked 
distribution (Fig. [2ja)) indicates agreement among users about 
sub-concepts and implies that the root concept is meaningful 
to others. A flat distribution (Fig. ^h)) implies there is little 
agreement about the root concept, with practically each user 
expressing a different sub-concept. This indicates that the 
root concept is vague. We quantify the peakedness of the 
distribution by measuring how many unique nodes are needed 
to cover 30%, 50% and 70% of child nodes. For example, to 
cover 70% of the distinct children of the root 'europe', we need 
to look at 21.3% of the most frequent children, while to cover 
the same fraction of children of 'other stuff, we need to look 
at 64.6% of the most frequent children. Other root concepts in 
our data set that are meaningful to many users include 'nature', 
'animal', 'flower', 'bird', 'usa', 'sport', while the vaguer, less 
meaningful concepts include 'location', 'subject', 'everything 
else', 'landscape', 'random', 'stuff, and 'miscellaneous'. 

Other features characterizing root diversity include the num- 
ber of people who have created a root node with that name, 
and the number of unique children the root has over all users. 

C. Automatically Identifying Experts 

We collected saplings created by 7,121 Flickr users who 
were members of wildlife and nature photography public 
groups. We trained a model to use the features above to 
automatically identify experts among these users. We trained 
the model on a small set of manually labeled data and used it 
to label a larger test set. We then examined and labeled new 
predictions made by the model, added them to the training 
set and retrained the model. We iterated this self-training 
procedure on the unlabeled test data to discover new experts, 
and re-trained the model with the enriched data. 

To create the initial training set, we asked three annotators to 
review saplings created by 200 Flickr users randomly selected 
from the set of 1000 who specified most relations. Annotators 
used the criteria above to identify experts. Each user's saplings 
were laid out hierarchically using yEd graph visualization 
tool. Annotators identified 20-45 experts among 200 users. 
We treated 19 experts all annotators agreed upon as positive, 
and the rest as negative, examples in the training set. 
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TABLE I 

J48, Random Forest, and LibSVM model cross validation 

RESULTS AT EACH ITERATION. THE SIZE OF THE TRAINING SET 
INCREASES AT EACH ITERATION AS POSITIVE PREDICTIONS MADE BY THE 
MODEL ARE ADDED TO THE TRAINING SET. 



We trained three different models (J48 fT4\ , Random- 
Forest CSl, and LibSVM |il61) on the training set of 200 
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TABLE II 

Feature selection results, with features sorted by their 
average rank. 
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Fig. 3. Relational affinity propagation (RAP): (a) two saplings being merged. 
Dashed lines surround a group of nodes assigned to the same exemplar (in 
orange), (b) Binary variable matrix corresponding to the configuration in (a), 
(c) Factor graph formulation of binary RAP. 




labeled users, and applied the models to classify unlabeled 
test data. We aggregated positive predictions made by all three 
models, manually labeled them, and iterated the procedure. 
Table |l] reports results of cross validation at each iteration. 
We reached 100% precision, 88% recall and 93% f-score with 
LibSVM after eight iterations and stopped at this point. After 
eight iterations, our training set had 315 users, of which 43 
were experts. Note that only a small fraction of all users can 
be classified as experts. Self-training enabled us to enrich the 
training set with positive examples without having to label 
thousands of users. Results of 10-fold cross validation of 
libSVM on labeled data was 84% precision, 65% recall, and 
74% f-score. Applying the final model to the entire data set 
identified 66 experts in total. 

To see which features are important, we used four feature 
selection algorithms: SVM Attribute Evaluation |17|, Relief 
for Attribute Estimation (181, Information Gain Attribute 
Evaluation [?], and Chi Squared Attribute Evaluation 1191 . 
SVM Attribute Evaluation method based its decision function 
on the support vectors of the borderline cases, while others 
based their decisions on the average cases. This difference 
leads to different rankings of features. Relief evaluates the 
importance of a feature by repeatedly sampling an instance 
and estimating how well feature values distinguish among in- 
stances near each other. Table |II] reports how different features 
are ranked by these algorithms. All methods identify sapling 
depth as the most important feature for identifying experts. 
All methods besides SVM choose the number of leaves in the 
sapling, and how balanced they are within the sapling, as the 
next most important features. Generally, sapling-level features 
are judged to be more important than user-level features by 
all methods, similar to intuitions of human annotators. 

III. Using Expert Knowledge in Folksonomy 
Learning 

Plangprasopchok et al. Q proposed a method to learn 
folksonomies by clustering many saplings created by dif- 
ferent users. Their relational affinity propagation (RAP) is 
a probabilistic method for clustering structured data into a 
common deeper and bushier tree. RAP merges root nodes 



of different saplings to extend the breadth of the learned 
folksonomy, and it merges a child node of one sapling to the 
root of another to extend its depth. RAP is based on affinity 
propagation (AP) Q, and it identifies a set of exemplars that 
best represent all the data. Exemplars emerge as messages 
are passed between data items, with each item seeking an 
assignment to the most similar exemplar. AP identifies a set 
of exemplars, or clusters, which maximize the net similarity 
between exemplars and data items assigned to them. 

Following binary AP framework of 1201 , let c be an x 
matrix, were A/^ is a number of data items. A binary variable 
Cij = 1 if node (data item) i is assigned to node j (i.e., j is 
an exemplar of i)\ otherwise, Cij = 0. AP uses constraints to 
guide the inference process to ensure cluster consistency. The 
first constraint, which is imposed on the row i, indicates 
that a data item can belong to only one exemplar (^^ Cij = 1). 
The second constraint, Ej, which is imposed on the column 
j, indicates that if an item other than j chooses j as its 
exemplar, then j must be its own exemplar (cjj = 1). AP 
avoids assigning exemplars which violate these constraints. 

A similarity function S{.) measures the similarity of a node 
to its exemplar. If Cij = 1, then we add S{cij) to the objective 
function; otherwise, S{cij) = 0. The self- similarity, S{cjj), 
also called preference, is usually set to less than the maximum 
similarity value in order to avoid creating a configuration with 
N exemplars. In general, the higher the value of preference for 
a particular item, the more likely it is to become an exemplar. 
Setting all preferences to the same value indicates that all 
items are equally likely to become exemplars. The global 
objective function measures the quality of a configuration (i.e., 
exemplars and items assigned to them): 

S(cii, • • • ,cnn) = ^Sij{cij) + ^/i(cii, • • • ,CiAr) 

3 

A message passing algorithm O is used to find a configuration 
that maximizes the net similarity without violating / and E 
constraints. 



A. Relational Affinity Propagation 



message update formulas for /3, r], a, p, r and a: 



In order to cluster structured data into a tree, Plangprasop- 
chok et al. 1 1 1 introduced a new "single parent" constraint. The 
F-constraint allows a node to select another as an exemplar 
only if their parents belong to the same exemplar, thus ensuring 
that the learned structure forms a tree. Consider clustering 
structured data in Fig.jSja), where exemplars are in orange, and 
dashed lines surround nodes assigned to the same exemplar. 
When child nodes i and k decide whether to merge with node 
j, the F-constraint checks whether their parents h and m 
belong to the same exemplar. Figure (Sj^b) shows the binary 
variable matrix corresponding the configuration in (a). This 
configuration is undesirable since it does not correspond to 
a tree: nodes i and k are assigned to exemplar j, but their 
parents belong to different exemplars. 

In its original formulation, the F-constraint was imposed on 
child nodes only and could result in undesirable configurations. 
The F-constraint checks whether i and k can be assigned 
to j, and since they cannot, it forces them into separate 
clusters. While the configuration is valid, it leads to a shallow 
folksonomy. We modify the F-constraint to prevent such 
situations. The modified F-constraint is imposed on both child 
and parent nodes, if the parent node is also an exemplar: 



-oo 



3 child 



1 and 



ex{pa{i)) 7^ ex{pa{ne{j))) 
otherwise 



where ne(.) returns a set of nodes that share the exemplar of its 
argument, pa{.) returns index of the parent of its argument, 
and ex{.) returns the index of the argument's exemplar. In 
the illustration in Fig. |3j suppose that k is found to be 
similar enough to j so that they can be merged. To decide 
whether i too can choose j as an exemplar, the modified F- 
constraint checks whether the parent exemplar of node i is 
the same as the parent exemplar of any of j's neighbors. If 
no, i won't be able to pick j as an exemplar. The objective 
function in Eq. |2] is modified by the addition of the new term 
Fj (cij , • • • , ciat); we use max- sum method to optimize it. 



B. Integrating Expert Knowledge 

RAP provides a framework to integrate experts' knowledge 
in folksonomy learning. We do this simply by giving the nodes 
from saplings created by experts higher preference, or self- 
similarity, values. This means that these nodes will be more 
likely to become exemplars, and expert knowledge will guide 
the folksonomy learning process. 



C. Implementing RAP 

Binary RAP may be written as a factor graph shown 
in Fig. |3jc). Following Ref. L21J and Ref. UJ, we derived 



Efc^j max[pfcj-,0] i=j 
min [0 , p J J + ^ ^ ^ ^ ^ ^ ^. max [pfc , 0] ] iy^j 

max[afc,-,0] i=j 
min[0,Pjj+Efc^^,^^-;^.fcg5{r.e(j)} max [cTfcj ,0]] i/j 
s{ij)+'nij+cxij. (8) 
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In Eqs. ^ and ([S]) s^"^^*^-^^^ represents set of nodes sharing 
same parent exemplar as neighbors of j. Note that we do not 
need to check all neighbors of j, but just one child node among 
all neighbors, since all nodes in ne{j) must already share 
parent exemplar. These message update equation will make 
our model favor the valid configuration, which maximizes the 
objective function S(cii,--- ^cnn)- Since message passing 
algorithms can be written in max-sum form, they can be easily 
parallelized on multi-core computers 1221 . We implemented 
the message update formulas using map-reduce parallel pro- 
gramming framework |23|, which ran on 30+ node cluster. 

IV. Experimental Results 

We measured the impact of expert knowledge on the folk- 
sonomies learned from Flickr data. Our data set consists of 
20,759 saplings created by 7,121 users. A node can be a col- 
lection or a set. The tags of all photos within a set are assigned 
to the set node and propagated to the collection node. We 
stemmed all terms (tags, set and collection names) using the 
Porter stemming algorithm and measured similarity between 
a pair of nodes i and j by the number of common tags tij 
they have among their top 40 tags: S{i^j) = mm(l., 
We infer exemplars and clusters by initializing all messages 
to zero, and update exemplar assignments at each iteration 
until convergence. We check convergence by monitoring the 
number of exemplars and the stability of net similarity. 

We selected 31 seed terms consistent with Ref. |1| and 
generated folksonomies for these seed terms using RAP with 
and without expert knowledge. To learn a folksonomy, we first 
need to select relevant saplings from the data set. We created a 
snowball sample of relevant saplings as follows. For the seed 
term that will be the root of the learned folksonomy, first we 
retrieve all saplings whose root has the same name as the seed 
term. We then retrieve saplings whose root has the same name 
as one of the children in the first set of saplings, and so on. 
We include expert knowledge in one of two ways: (1) using 
snowball sample of relevant saplings, including those created 
by the 66 experts the model identified; (2) in addition to these, 
use all saplings created by the experts in the snowball sample. 

Besides varying the amount of expert knowledge used by 
the learning algorithm, we can also vary its weight. We used 
the following strategies to vary the emphasis placed on expert 
knowledge: (1) treat all users uniformly by setting preference 
values of all nodes to the mean of similarity scores (ordinary 




Fig. 4. Folksonomies for 'africa' learned (a) without and (b) with expert 
knowledge (expert nodes in orange). 



RAP); (2) set preference values of expert nodes to twice the 
mean, while all other preference values are set to the mean. 

As an illustration, consider portion of the Africa' folkson- 
omy, shown in Fig.Qa) learned using saplings such as those in 
Fig. [T] but without differentiating between expert and novice 
users. The root has a child 'Christmas', because some people 
spent their Christmas holidays in Africa. Since 'Christmas' 
is linked to many other concepts such as 'family', 'card', 
etc, it introduces irrelevant concepts into 'Africa' folksonomy. 
Figure |4jb) shows portion of the 'Africa' folksonomy learned 
with expert knowledge. Now the 40 nodes ('xmas', 'family', 
'card', etc.) originally placed under 'Africa' 'Christmas' 
were moved to 'holiday' 'Christmas'. Moreover, 'Table 
Mountain' and other nodes under 'Africa' 'Cape Town' 
were moved under 'Africa' 'South Africa' 'Cape Town'. 
As we can see from this illustration, adding expert knowledge 
helps produce a more relevant and detailed folksonomy. 

A. Automatic Evaluation 



Table III reports results of running RAP in three different 
settings for 31 seed terms: (Ml) relevant saplings collected by 
the snowball sample with no differentiation between novice 
and expert users (all preference values set to the mean); (M2) 
using relevant saplings plus all other saplings from experts, 
with no differentiation between users (mean +EXP)\ (M3) 
same saplings as before, but with higher preference values 
for experts (2^mean + EXP). While the learning algorithm 
generally produces several trees, we evaluate only the most 
'popular' tree, one that aggregates the greatest number of 
saplings. The popular tree learned by Ml contained between 
14 and 7925 nodes (2001.26 on average), and that learned by 
M2 between 16 and 8114 nodes (1947.87 on average), while 
folksonomies learned by M3 were smaller, between 14 and 
5667 nodes (1292.81 on average). 

We automatically measure the quality of the learned folk- 
sonomies by comparing them to the reference taxonomy 
from the Open Directory Project (ODP) ll24ll . We applied 
two metrics: Lexical Precision (LP) and Taxonomic Overlap 
(TO) |25|. LP measures term overlap between the learned and 
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Evaluation of folksonomies learned for 31 (stemmed) seed 
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reference taxonomies, independent of their structure, while 
TO measures the overlap of ancestors and descendants of 
a pair of terms from the learned and reference taxonomies 
without considering their order. We also measure the depth 
of the taxonomy. We observe that while RAP leads to few 
or no structural inconsistencies, integrating expert knowledge 
into the learning process improves the quality of the learned 
taxonomies (higher LP and TO scores) and how detailed 
they are (greater depth), while also removing irrelevant nodes 
(smaller trees). 

Is expert knowledge alone sufficient to produce high qual- 
ity folksonomies? The last column in Table [III] shows the 
percentage of nodes in the learned folksonomy that can be 
attributed to experts. On average, this fraction is less than 10%. 
We conclude that integrating knowledge from both expert and 
novice users leads to more comprehensive folksonomies than 
using expert knowledge alone. 

B. Manual Evaluation 

Automatic method was not comprehensive, since it can only 
evaluate portions of the learned folksonomies that used the 
vocabulary of the reference taxonomy. Therefore, we also 
carried out a manual evaluation using the Coding Analysis 
Toolkit (CAT) 1261 , which provides a Web interface for users 
to answer customized questions. Each question presented to 
the user a portion of the learned folksonomy, laid out as a 
tree, and asked if it was correct. Since the trees were generally 
very large, we reduced their size as follows. For a pair of 



folksonomies learned by methods Ml and M3 for some seed 
term, we identified leaf nodes with the same name and the 
same ancestors in the two trees and removed them from both 
trees. Applying this strategy iteratively eliminated on average 
50% to 70% of the nodes. If the reduced tree was still large, we 
segmented it into disjoint subtrees with at most 10 child nodes 
at any level. We asked five annotators to determine whether 
each reduced tree (or subtree) was correct (837 questions 
total). Overall annotators judged 45.30% of the trees learned 
by method Ml and 68.24% learned by M3 to be correct. 
Thus, using expert knowledge leads to better folksonomies. 

We calculated statistical significance of results of automatic 
and manual annotation. We find that the difference in TO 
scores between RAP without and with expert knowledge is 
significant at 95% level with t(31)=2.265, p < 0.05. Moreover, 
RAP with expert knowledge improves correctness by 23% on 
the manual annotation task. We believe that combining auto- 
matic and manual evaluation leads to a convincing evaluation 
of folksonomy learning. 

C. Robustness 




10 20 50 100 

% swapped 

(b) 

Fig. 5. Robustness of proposed method, as measured by the taxonomic 
overlap (TO), with respect to (a) preference values and (b) percentage of 
experts misidentified. 



Finally, we address robustness of the method with respect 
to changes in the preference values assigned to expert nodes. 
We ran our algorithm for six preference values of the form 
x*mean, where x G {0,0.5,1.0,1.5,2.0,3.0}. We report 
TO scores for three seeds ('invertebrate', 'africa', and 'bird') 
in Fig. [5jleft). The quality of the learned folksonomies, as 
measured by TO, rises with preference values, and saturates 
around x = 2.0. 

Another question is how the accuracy of automatic expert 
identification affects the quality of the learned folksonomies. 



For this experiment, we randomly selected n% of expert nodes 
and swapped their preference values with the same number 
of randomly selected novice nodes. We varied percentage of 
swapped nodes from 0% to 100% and report TO scores for the 
three learned folksonomies in Fig. [5jright). As we increased 
the number of swapped nodes, TO scores dropped by 9%- 
12%. Note that when all expert nodes were swapped for 
novice nodes, i.e., random novice nodes had their preference 
values set to 2* mean, the TO scores were similar to those 
that did not differentiate between expert and novice nodes. 
The difference between 100% and 0% swapped is similar to 
RAP with and without expert knowledge, as expected. We 
conclude that even moderately high errors (up to 50%) in 
expert identification do not significantly degrade the quality 
of the learned folksonomies. 

V. Related Work 

Expert identification has been addressed by researchers in 
several different fields. Existing works analyze the (textual) 
content of documents people create, the link structure of 
the interactions between people, or a combination of both 
methods. Zhang et al. ^ proposed a probabilistic algorithm 
to find experts on a given topic by using local information 
about a person (e.g., publications) and relationships between 
people. A similar approach was used by Maybury O to find 
experts within organizations from the documents (publications, 
publicly shared folders) they create and relations between them 
(project information, citations). Balog et al. \1\ used genera- 
tive language models to identify experts among authors of 
documents, while Deng et al. 1 8 1 explored topic-based model 
for finding experts in academic fields. Davitz et al. ifTOl used 
network analysis tools to identify experts based on the docu- 
ments or email messages they create within their organizations. 
Content quality analysis in social media has been investigated 
from many research. Agichtein et al. 1 27 1 investigated methods 
to measuring quality of contents by content, user relationship 
features. Hu et al. 1281 proposed quality accessing model using 
the interaction data between articles and their contributors. Our 
approach is similar in spirit, in that we look at the contents of 
data people create to identify experts, although we have not yet 
included relations between people into analysis. Unlike these 
earlier methods, we use the structure of annotations to measure 
their expertise on a topic. While Korner et al. ifTTIl proposed 
a method to differentiate users in social tagging systems, they 
classify users as categorizers and describers based on their 
tag usage, and show that there is more semantic agreement 
between describers. They do not attempt to learn taxonomies 
nor differentiate the quality of annotations. 

With the advent of crowdsourcing services, labeling large 
datasets has become easier. However, due to variations in 
annotators' abilities, significant post-processing is required. 
To address this problem, Welinder et al. 11291 proposed a 
labeling strategy based on the estimation of most likely value 
of current labels and annotator's abilities. Sheng et al. |30| 
studied repeated-labeling strategies to improve label quality. 
Our work is different in the sense that on the Social Web users 



freely choose content to label, as well as labels themselves 
(tags, directories), that reflect their own interest in content. 
Our work is also related to broader efforts to "crowdsource" 
knowledge production, embodied, for example, by "citizen 
science" projects and "wisdom of crowds" approaches lISTIl . 
Researchers have studied methods that aggregate data of 
varying quality 1321 , 1331 . However, the amount and variation 
of data in these studies was limited. Our approach can auto- 
matically identify the quality of data and aggregate it from 
thousands of users. 

VI. Conclusion 

In this paper, we propose a framework to automatically 
identify experts based on the linguistic and structural features 
of the annotations they create, and use experts' annotations to 
guide the folksonomy learning process. We show that using 
experts' knowledge can produce more accurate and detailed 
folksonomies. We also show that proposed method is robust 
to errors in expert identification. Our work generalizes beyond 
Flickr to other structured data sources (eBay categories, Deli- 
cious bundles, Bibsonomy relations, file systems). 

In future work, we would like to extend automatic expert 
identification procedure using Bayesian approach |31 1. Experts 
are able to be modeled in continuous variable rather than 1 
or binary variable. By identifying experts in more detail, 
we could control the degree to which experts knowledge is 
used. We would also like to extend RAP to apply to other 
structure learning problems, such as alignment of biological 
data. Finally, we would like incorporate more efficient infer- 
ence algorithm and compare the aproach to other statistical 
relational learning approaches. 
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